8 research outputs found
InstaHide: Instance-hiding Schemes for Private Distributed Learning
How can multiple distributed entities collaboratively train a shared deep net
on their private data while preserving privacy? This paper introduces
InstaHide, a simple encryption of training images, which can be plugged into
existing distributed deep learning pipelines. The encryption is efficient and
applying it during training has minor effect on test accuracy.
InstaHide encrypts each training image with a "one-time secret key" which
consists of mixing a number of randomly chosen images and applying a random
pixel-wise mask. Other contributions of this paper include: (a) Using a large
public dataset (e.g. ImageNet) for mixing during its encryption, which improves
security. (b) Experimental results to show effectiveness in preserving privacy
against known attacks with only minor effects on accuracy. (c) Theoretical
analysis showing that successfully attacking privacy requires attackers to
solve a difficult computational problem. (d) Demonstrating that use of the
pixel-wise mask is important for security, since Mixup alone is shown to be
insecure to some some efficient attacks. (e) Release of a challenge dataset
https://github.com/Hazelsuko07/InstaHide_Challenge
Our code is available at https://github.com/Hazelsuko07/InstaHideComment: ICML 202
Matching-based Data Valuation for Generative Model
Data valuation is critical in machine learning, as it helps enhance model
transparency and protect data properties. Existing data valuation methods have
primarily focused on discriminative models, neglecting deep generative models
that have recently gained considerable attention. Similar to discriminative
models, there is an urgent need to assess data contributions in deep generative
models as well. However, previous data valuation approaches mainly relied on
discriminative model performance metrics and required model retraining.
Consequently, they cannot be applied directly and efficiently to recent deep
generative models, such as generative adversarial networks and diffusion
models, in practice. To bridge this gap, we formulate the data valuation
problem in generative models from a similarity-matching perspective.
Specifically, we introduce Generative Model Valuator (GMValuator), the first
model-agnostic approach for any generative models, designed to provide data
valuation for generation tasks. We have conducted extensive experiments to
demonstrate the effectiveness of the proposed method. To the best of their
knowledge, GMValuator is the first work that offers a training-free, post-hoc
data valuation strategy for deep generative models
Privacy Implications of Retrieval-Based Language Models
Retrieval-based language models (LMs) have demonstrated improved
interpretability, factuality, and adaptability compared to their parametric
counterparts, by incorporating retrieved text from external datastores. While
it is well known that parametric models are prone to leaking private data, it
remains unclear how the addition of a retrieval datastore impacts model
privacy. In this work, we present the first study of privacy risks in
retrieval-based LMs, particularly NN-LMs. Our goal is to explore the optimal
design and training procedure in domains where privacy is of concern, aiming to
strike a balance between utility and privacy. Crucially, we find that NN-LMs
are more susceptible to leaking private information from their private
datastore than parametric models. We further explore mitigations of privacy
risks. When privacy information is targeted and readily detected in the text,
we find that a simple sanitization step would completely eliminate the risks,
while decoupling query and key encoders achieves an even better utility-privacy
trade-off. Otherwise, we consider strategies of mixing public and private data
in both datastore and encoder training. While these methods offer modest
improvements, they leave considerable room for future work. Together, our
findings provide insights for practitioners to better understand and mitigate
privacy risks in retrieval-based LMs. Our code is available at:
https://github.com/Princeton-SysML/kNNLM_privacy
Sparsity-Preserving Differentially Private Training of Large Embedding Models
As the use of large embedding models in recommendation systems and language
applications increases, concerns over user data privacy have also risen.
DP-SGD, a training algorithm that combines differential privacy with stochastic
gradient descent, has been the workhorse in protecting user privacy without
compromising model accuracy by much. However, applying DP-SGD naively to
embedding models can destroy gradient sparsity, leading to reduced training
efficiency. To address this issue, we present two new algorithms, DP-FEST and
DP-AdaFEST, that preserve gradient sparsity during private training of large
embedding models. Our algorithms achieve substantial reductions ()
in gradient size, while maintaining comparable levels of accuracy, on benchmark
real-world datasets.Comment: Neural Information Processing Systems (NeurIPS) 202